0. Introduction

This document lists the exploratory data analysis, model build and analysis for Coasting on Couches, the term project for MGT 6203 Spring semester. Whilst the project document and slides/ presentation list a more human-readable version, this document combines some exposition with a lot of code to generate graphs, to put it simply.

0.1 Library Setup

The document will be shared as a Shiny app and in the raw RMd format.

Please run the following chunk to ensure all the necessary libraries are installed/ present should you wish to execute the chunks at your end. (Idea taken from this blogpost)

Please execute install.packages("pls") in the console if it’s the first time you’re running it. There could be some additional installations, depending on your operating system.

1. Loading Data

We had downloaded data from InsideAirbnb.com to the data folder. This will be available on our GitHub repo.

The data is in four parts (each city has all five elements): 1. listing: These are the actual Airbnb listings. The columns are defined here. This is set of 74 variables pertaining to a specific listing. 2. review: List of reviews per row in the listing table. 3. calendar: Price of a particular listing on a particular date, along with min-max nights for hire 4. neighbourhoods: List of neighbourhoods screened in the city 5. map: GeoJson shapefile showing district boundaries.

1.1 Raw

We will read all the data into dataset variables.

#Singapore
listing.sin <- read.csv("./data/SIN_listings.csv")
reviews.sin <- read.csv("./data/SIN_reviews.csv")
calendar.sin <- read.csv("./data/SIN_calendar.csv")
neighbourhoods.sin <- read.csv("./data/SIN_neighbourhoods.csv")
map.sin <- geojson_read("./data/SIN_neighbourhoods.geojson")

#Taipei
listing.tpe <- read.csv("./data/TPE_listings.csv")
reviews.tpe <- read.csv("./data/TPE_reviews.csv")
calendar.tpe <- read.csv("./data/TPE_calendar.csv")
neighbourhoods.tpe <- read.csv("./data/TPE_neighbourhoods.csv")
map.tpe <- geojson_read("./data/TPE_neighbourhoods.geojson")

#Tokyo
listing.nrt <- read.csv("./data/NRT_listings.csv")
reviews.nrt <- read.csv("./data/NRT_reviews.csv")
calendar.nrt <- read.csv("./data/NRT_calendar.csv")
neighbourhoods.nrt <- read.csv("./data/NRT_neighbourhoods.csv")
map.nrt <- geojson_read("./data/NRT_neighbourhoods.geojson")

#Hong Kong
listing.hkg <- read.csv("./data/HKG_listings.csv")
reviews.hkg <- read.csv("./data/HKG_reviews.csv")
calendar.hkg <- read.csv("./data/HKG_calendar.csv")
neighbourhoods.hkg <- read.csv("./data/HKG_neighbourhoods.csv")
map.hkg <- geojson_read("./data/HKG_neighbourhoods.geojson")

1.2 Initial Data Analysis

1.2.0 Number of Listings

We first see the number of listings per city.

cities  <- c("Singapore", "Tokyo", "Taipei", "Hong Kong")
no_of_listings <- c(nrow(listing.sin), nrow(listing.nrt), nrow(listing.tpe), nrow(listing.hkg))
no_of_listings.fig <- plot_ly(
  x = cities,
  y = no_of_listings,
  type = "bar", 
  text = no_of_listings
)
no_of_listings.fig <- no_of_listings.fig %>% layout(title ="No of Listings Per City", yaxis = list(title="No of Listings"))
no_of_listings.fig

Clearly, Tokyo has the largest number of listings followed by Hong Kong, Taipei and then Singapore.

1.2.1 Listings by Neighbourhood - Choropleth Maps

Let us also consider heatmaps of where the listings are in each city by neighbourhood. Admittedly, the InsideAirbnb website has a map for each city (here’s one for Taipei), but this does not show by district. At a later stage, this can be enhanced to see by variable or time-series.

generate_choropleth_by_city <- function (listing, map, city_name)
{
  listings_by_neighbourhood <- listing %>%
                                count(neighbourhood_cleansed) 
  # neighbourhoods_zero <- neighbourhoods %>%
  #                       filter(!neighbourhood %in% listings_by_neighbourhood$neighbourhood) %>%
  #                       mutate(n = 0) %>%
  #                       select(neighbourhood, n)
  # listings_by_neighbourhood <- union(listings_by_neighbourhood, neighbourhoods_zero)
  # print(listings_by_neighbourhood)
  g <- list (
    fitbounds = "locations",
    visible = FALSE
  )
  fig <- plot_ly()
  fig <- fig %>% add_trace(
    type="choropleth",
    geojson=map,
    locations=listings_by_neighbourhood$neighbourhood_cleansed,
    z=listings_by_neighbourhood$n,
    colorscale="jet",
    featureidkey="properties.neighbourhood"
  )
  
  fig <- fig %>% layout(
    geo = g
  )
  fig <- fig %>% colorbar(title = "No of listings")
  fig <- fig %>% layout(
    title = paste0("Listings by Neighbourhood - ", city_name)
  )
  fig
}
generate_choropleth_by_city(listing.sin, map.sin, "Singapore")
generate_choropleth_by_city(listing.nrt, map.nrt, "Tokyo")
generate_choropleth_by_city(listing.hkg, map.hkg, "Hong Kong")
generate_choropleth_by_city(listing.tpe, map.tpe, "Taipei")
bin_districts <- function(listing, bins)
{
  district_bins <- listing %>%
                    count(neighbourhood_cleansed) %>%
                    arrange(desc(n))%>%
                    mutate(nb_group = ntile(n,n=bins)) %>%
                    arrange(desc(nb_group))
  return(district_bins)
}

bin_districts(listing.sin, 4)
##     neighbourhood_cleansed   n nb_group
## 1                  Kallang 405        4
## 2                  Geylang 340        4
## 3            Downtown Core 317        4
## 4                   Outram 309        4
## 5                   Rochor 270        4
## 6                   Novena 235        4
## 7                    Bedok 186        4
## 8              Bukit Merah 178        4
## 9             River Valley 149        4
## 10              Queenstown 143        4
## 11         Singapore River 108        4
## 12                 Tanglin  90        3
## 13                 Orchard  80        3
## 14                Clementi  72        3
## 15             Jurong East  71        3
## 16             Jurong West  59        3
## 17           Marine Parade  57        3
## 18                  Newton  57        3
## 19             Bukit Timah  56        3
## 20               Woodlands  44        3
## 21                 Hougang  43        3
## 22               Toa Payoh  40        3
## 23                  Bishan  39        2
## 24               Serangoon  35        2
## 25               Pasir Ris  33        2
## 26             Bukit Batok  31        2
## 27                Tampines  30        2
## 28              Ang Mo Kio  26        2
## 29               Sembawang  26        2
## 30                Sengkang  19        2
## 31                  Yishun  19        2
## 32        Southern Islands  18        2
## 33                 Punggol  17        2
## 34           Choa Chu Kang  17        1
## 35                  Museum  15        1
## 36           Bukit Panjang  14        1
## 37 Central Water Catchment  12        1
## 38            Marina South   3        1
## 39 Western Water Catchment   3        1
## 40                  Mandai   2        1
## 41            Lim Chu Kang   1        1
## 42                 Pioneer   1        1
## 43            Sungei Kadut   1        1
## 44                    Tuas   1        1

1.2.2 Listings by Neighbourhood - Bar Charts

We could also see this as barcharts.

bar_charts_by_neighbourhood <- function (listing, city_name, neighbourhoods)
{
  listings_by_neighbourhood <- listing %>%
                                count(neighbourhood_cleansed) %>%
                                # rename(neighbourhood = neighbourhood_cleansed)
                                arrange(desc(n))
  # print(listings_by_neighbourhood)
  neighbourhoods_zero <- neighbourhoods %>%
                        filter(!neighbourhood %in% listings_by_neighbourhood$neighbourhood_cleansed) %>%
                        rename(neighbourhood_cleansed = neighbourhood) %>%
                        mutate(n = 0) %>%
                        select(neighbourhood_cleansed, n)
  print(neighbourhoods_zero)
  listings_by_neighbourhood <- union(listings_by_neighbourhood, neighbourhoods_zero)
  # print(listings_by_neighbourhood)
  fig<- plot_ly(y=listings_by_neighbourhood$neighbourhood_cleansed, x=listings_by_neighbourhood$n, type="bar", orientation="h") %>%
        layout(yaxis=list(categoryorder = "total ascending"), title=paste("Listings per neighbourhood in", city_name))
  fig
}

bar_charts_by_neighbourhood(listing.sin, "Singapore", neighbourhoods.sin)
##    neighbourhood_cleansed n
## 1             Marina East 0
## 2            Straits View 0
## 3                  Changi 0
## 4              Changi Bay 0
## 5              Paya Lebar 0
## 6   North-Eastern Islands 0
## 7                 Seletar 0
## 8                 Simpang 0
## 9                Boon Lay 0
## 10                 Tengah 0
## 11        Western Islands 0
bar_charts_by_neighbourhood(listing.nrt, "Tokyo", neighbourhoods.nrt)
##    neighbourhood_cleansed n
## 1          Aogashima Mura 0
## 2               Fussa Shi 0
## 3           Hachijo Machi 0
## 4       Higashiyamato Shi 0
## 5            Hinode Machi 0
## 6           Hinohara Mura 0
## 7               Inagi Shi 0
## 8              Kiyose Shi 0
## 9          Kozushima Mura 0
## 10        Mikurajima Mura 0
## 11            Miyake Mura 0
## 12           Mizuho Machi 0
## 13           Niijima Mura 0
## 14         Ogasawara Mura 0
## 15           Oshima Machi 0
## 16           Toshima Mura 0
bar_charts_by_neighbourhood(listing.hkg, "Hong Kong", neighbourhoods.hkg)
## [1] neighbourhood_cleansed n                     
## <0 rows> (or 0-length row.names)
bar_charts_by_neighbourhood(listing.tpe, "Taipei", neighbourhoods.tpe)
## [1] neighbourhood_cleansed n                     
## <0 rows> (or 0-length row.names)

The majority of listings in Taipei, Tokyo and Hong Kong are in tourist-heavy districts. While the top district in Tokyo is Shinjuku, that in Hong Kong is Yau Tsim Mong, Kowloon’s core urban area formed by the combination of Yau Ma Tei, Tsim Sha Tsui and Mong Kok. Taipei’s top district is Zhongzheng district (“中正區”), consisting of historic sites and cultural performances.

In contrast to the other three cities, the top district in Singapore is the high-end residential condo district, Kallang. Not tourist heavy, but close to the downtown’s many attractions. The tourist-heavy Geylang, Downtown Core and Outram districts appear after Kallang.

Taipei and Hong Kong have listings in all districts. But 11 districts in Singapore (mostly military installations or the airport) and 16 districts in Tokyo do not have any listings.

1.2.3 Amenities

There’s a column called amenities in the dataset that appears to list all the self-reported amenities in the listing as a single comma-separated list. Let’s try to see this further.

For instance, here’s the longest list of amenities among Singapore listings.

listing_amenities.sin <- listing.sin %>%
                      mutate(amenities = str_replace(amenities, "\\[","")) %>%
                      mutate(amenities = str_replace(amenities, "\\]","")) %>%
                      mutate(amenities = str_replace_all(amenities, "\"","")) %>%
                      mutate(amenities = str_replace_all(amenities, ", " ,",")) %>%
                      mutate(amenities_list = as.list(strsplit(amenities, ","))) %>%
                      mutate(no_of_am = lengths(amenities_list)) %>%
                      mutate(Wifi = as.numeric(grepl('Wifi', amenities, fixed = TRUE))) %>%
                      mutate(Shampoo = as.numeric(grepl('Shampoo', amenities, fixed = TRUE))) %>%
                      mutate(Kitchen = as.numeric(grepl('Kitchen', amenities, fixed = TRUE))) 

# listing_amenities.sin %>% select(amenities, Wifi, Shampoo, Kitchen, Patio)

max_amenities.sin <- listing_amenities.sin %>%
                      select(amenities, no_of_am) %>%
                      group_by() %>%
                     slice(which.max(no_of_am))
amenities_list_string <- as.list(strsplit(as.character(max_amenities.sin["amenities"]), ","))
amenities_list_string
## [[1]]
##  [1] "Toaster"                         "Sound system"                   
##  [3] "Safe"                            "Indoor fireplace"               
##  [5] "Backyard"                        "Hangers"                        
##  [7] "Bed linens"                      "Hot water kettle"               
##  [9] "Freezer"                         "Coffee maker"                   
## [11] "Washer"                          "Cooking basics"                 
## [13] "Bathtub"                         "Hair dryer"                     
## [15] "Clothing storage"                "Oven"                           
## [17] "Outdoor furniture"               "Paid parking on premises"       
## [19] "High chair"                      "Children\\u2019s books and toys"
## [21] "Dedicated workspace"             "Crib"                           
## [23] "Dining table"                    "Free parking on premises"       
## [25] "Pool"                            "Cleaning products"              
## [27] "Wine glasses"                    "Cleaning before checkout"       
## [29] "Game console"                    "Long term stays allowed"        
## [31] "Drying rack for clothing"        "Outdoor dining area"            
## [33] "Private entrance"                "Elevator"                       
## [35] "Patio or balcony"                "Refrigerator"                   
## [37] "Dryer"                           "Microwave"                      
## [39] "Baby bath"                       "Gym"                            
## [41] "Wifi"                            "Children\\u2019s dinnerware"    
## [43] "Smoke alarm"                     "Board games"                    
## [45] "Luggage dropoff allowed"         "Shampoo"                        
## [47] "Breakfast"                       "Extra pillows and blankets"     
## [49] "Heating"                         "Conditioner"                    
## [51] "Cable TV"                        "Hot tub"                        
## [53] "Hot water"                       "Stove"                          
## [55] "Body soap"                       "BBQ grill"                      
## [57] "Iron"                            "Essentials"                     
## [59] "Babysitter recommendations"      "Pack \\u2019n play/Travel crib" 
## [61] "Kitchen"                         "Changing table"                 
## [63] "First aid kit"                   "Dishes and silverware"          
## [65] "Air conditioning"                "Shower gel"                     
## [67] "TV with standard cable"          "Fire extinguisher"
#"Shampoo,Kitchen,Long term stays allowed,Washer,Smart lock,Hair dryer,Dryer,Wifi,Hot water,TV,Air conditioning,Smoke alarm,Fire extinguisher"

Apropos nothing, we will use the following amenities as dummy variables for price: > “Shampoo,Kitchen,Long term stays allowed,Washer,Hair dryer,Wifi,Hot water,TV,Air conditioning”

1.2.4 Host Verifications

Similarly, let us also further analyse the column host_verifications to see if we can generate dummy variables from there as well.

listing_host_verf.sin <- listing.sin %>%
                      mutate(host_verifications = str_replace(host_verifications, "\\[","")) %>%
                      mutate(host_verifications = str_replace(host_verifications, "\\]","")) %>%
                      mutate(host_verifications = str_replace_all(host_verifications, "\"","")) %>%
                      mutate(host_verifications = str_replace_all(host_verifications, ", " ,",")) %>%
                      mutate(host_verifications_list = as.list(strsplit(host_verifications, ","))) %>%
                      mutate(no_of_vf = lengths(host_verifications_list))

max_verf.sin <- listing_host_verf.sin %>%
                      select(host_verifications, no_of_vf) %>%
                      group_by() %>%
                     slice(which.max(no_of_vf))
host_verf_list_string <- as.list(strsplit(as.character(max_verf.sin["host_verifications"]), ","))
host_verf_list_string
## [[1]]
##  [1] "'email'"                 "'phone'"                
##  [3] "'facebook'"              "'google'"               
##  [5] "'reviews'"               "'jumio'"                
##  [7] "'offline_government_id'" "'selfie'"               
##  [9] "'government_id'"         "'identity_manual'"      
## [11] "'work_email'"

Let’s take this list to generate dummy variables. > [‘email’, ‘phone’, ‘facebook’, ‘reviews’, ‘manual_offline’, ‘jumio’, ‘offline_government_id’, ‘government_id’, ‘work_email’]

Let’s generalise these two bits for all cities and create dummy variables for each one of them.

wrangle_amenities_hostvf <- function (listing)
{
  listing <- listing %>%
            mutate(amenities = str_replace(amenities, "\\[","")) %>%
            mutate(amenities = str_replace(amenities, "\\]","")) %>%
            mutate(amenities = str_replace_all(amenities, "\"","")) %>%
            mutate(amenities = str_replace_all(amenities, ", " ,",")) %>%
            mutate(amenities_list = as.list(strsplit(amenities, ","))) %>%
            mutate(no_of_am = lengths(amenities_list)) %>%
            mutate(Amenities_Wifi = as.numeric(grepl('Wifi', amenities, fixed = TRUE))) %>%
            mutate(Amenities_Shampoo = as.numeric(grepl('Shampoo', amenities, fixed = TRUE))) %>%
            mutate(Amenities_Kitchen = as.numeric(grepl('Kitchen', amenities, fixed = TRUE))) %>%
            mutate(Amenities_Long_Term = as.numeric(grepl('Long term stays', amenities, fixed = TRUE))) %>%
            mutate(Amenities_Washer = as.numeric(grepl('Washer', amenities, fixed = TRUE))) %>%
            mutate(Amenities_HairDryer = as.numeric(grepl('Hair dryer', amenities, fixed = TRUE))) %>%
            mutate(Amenities_HotWater = as.numeric(grepl('Hot water', amenities, fixed = TRUE))) %>%
            mutate(Amenities_TV = as.numeric(grepl('TV', amenities, fixed = TRUE))) %>%
            mutate(Amenities_AC = as.numeric(grepl('Air conditioning', amenities, fixed = TRUE))) %>%

            mutate(host_verifications = str_replace(host_verifications, "\\[","")) %>%
            mutate(host_verifications = str_replace(host_verifications, "\\]","")) %>%
            mutate(host_verifications = str_replace_all(host_verifications, "\"","")) %>%
            mutate(host_verifications = str_replace_all(host_verifications, ", " ,",")) %>%
            mutate(host_verifications_list = as.list(strsplit(host_verifications, ","))) %>%
            mutate(hv_email = as.numeric(grepl('email', host_verifications, fixed = TRUE))) %>%
            mutate(hv_phone = as.numeric(grepl('phone', host_verifications, fixed = TRUE))) %>%
            mutate(hv_facebook = as.numeric(grepl('facebook', host_verifications, fixed = TRUE))) %>%
            mutate(hv_reviews = as.numeric(grepl('reviews', host_verifications, fixed = TRUE))) %>%
            mutate(hv_manual_offline = as.numeric(grepl('manual_offline', host_verifications, fixed = TRUE))) %>%
            mutate(hv_manual_jumio = as.numeric(grepl('jumio', host_verifications, fixed = TRUE))) %>%
            mutate(hv_manual_off_gov = as.numeric(grepl('offline_government_id', host_verifications, fixed = TRUE))) %>%
            mutate(hv_manual_gov = as.numeric(grepl('government_id', host_verifications, fixed = TRUE))) %>%
            mutate(hv_manual_work_email = as.numeric(grepl('work_email', host_verifications, fixed = TRUE))) %>%
            mutate(no_of_vf = lengths(host_verifications_list))
}

listing.sin <- wrangle_amenities_hostvf(listing.sin)
listing.nrt <- wrangle_amenities_hostvf(listing.nrt)
listing.tpe <- wrangle_amenities_hostvf(listing.tpe)
listing.hkg <- wrangle_amenities_hostvf(listing.hkg)

1.3 Data Wrangling

1.3.1 Wrangling Listings

This is a function that wrangles AirBnb data into an analysable chunk. Because we will be doing the same for multiple cities, we will do a function out of this. The function is based on top of code shared in the lecture for Module 2. The obvious additions are the id column, neighbourhoods and dummy variables for amenities and host verification.

wrangle_airbnb_dataset <- function (raw_listing_full)
{
  listing.raw <- raw_listing_full  %>% 
                select(id, price,number_of_reviews,beds,bathrooms,accommodates,reviews_per_month, property_type, room_type, review_scores_rating, neighbourhood_cleansed, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, latitude, longitude, amenities, last_review, no_of_am, Amenities_Wifi, Amenities_Shampoo, Amenities_Kitchen, Amenities_Long_Term, Amenities_Washer, Amenities_HairDryer, Amenities_HotWater, Amenities_TV,Amenities_AC, host_verifications, hv_email,hv_phone, hv_facebook, hv_reviews, hv_manual_offline, hv_manual_jumio,hv_manual_off_gov, hv_manual_gov, hv_manual_work_email, no_of_vf) %>% 
                rename(Reviews = number_of_reviews) %>% 
                rename(Beds = beds) %>% 
                rename(Baths = bathrooms) %>% 
                rename(Capacity = accommodates) %>% 
                rename(Monthly_Reviews = reviews_per_month) %>% 
                rename(Property_Type = property_type) %>% 
                rename(Room_Type = room_type) %>% 
                rename(Price = price) %>% 
                rename(Rating = review_scores_rating) %>%
                # rename(Neighbourhood = neighbourhood_cleansed) %>%
                rename(host_Superhost = host_is_superhost)


  listing.raw <-  listing.raw %>% 
                mutate(Price = str_replace(Price, "[$]", "")) %>% 
                mutate(Price = str_replace(Price, "[,]", "")) %>% 
                mutate(Price = as.numeric(Price)) %>% 
                
                # mutate(hood_factor = as.factor(Neighbourhood)) %>%
                
                mutate(host_response_rate = str_replace(host_response_rate, "[%]", "")) %>%
                mutate(host_response_rate = as.numeric(host_response_rate)/100) %>%
                mutate(host_acceptance_rate = str_replace(host_acceptance_rate, "[%]", "")) %>%
                mutate(host_acceptance_rate = as.numeric(host_acceptance_rate)/100) %>%
                mutate(host_Superhost = ifelse(host_Superhost =="f", 0, 1)) %>%
    
                mutate(host_response_rate = factor(host_response_rate, levels = c("within a few hours", "within a day", "a few days or more"))) %>%
                mutate(host_response_hours = ifelse(host_response_rate == "within a few hours"),1,0) %>%
                mutate(host_response_day = ifelse(host_response_rate == "within a day"),1,0) %>%
                mutate(host_response_few_days = ifelse(host_response_rate == "a few days or more"),1,0) %>%
                
                mutate(last_review = as.Date(last_review)) %>%
                mutate(Days_since_last_review = as.numeric(difftime(as.Date("2021-12-31"), last_review, units="days"))) %>%
    
                mutate(Room_Type = factor(Room_Type, levels = c("Shared room", "Private room", "Entire home/apt"))) %>% 
                mutate(Capacity_Sqr = Capacity * Capacity) %>% 
                mutate(Beds_Sqr = Beds * Beds) %>% 
                mutate(Baths_Sqr = Baths * Baths) %>% 
                mutate(ln_Price = log(1+Price)) %>% 
                mutate(ln_Beds = log(1+Beds)) %>%
                mutate(ln_Baths = log(1+Baths)) %>% 
                mutate(ln_Capacity = log(1+Capacity)) %>% 
                mutate(ln_Rating = log(1+Rating)) %>% 
                mutate(Shared_ind = ifelse(Room_Type == "Shared room",1,0)) %>% 
                mutate(House_ind = ifelse(Room_Type == "Entire home/apt",1,0)) %>% 
                mutate(Private_ind = ifelse(Room_Type == "Private room",1,0)) %>% 
                mutate(Capacity_x_Shared_ind = Shared_ind * Capacity) %>% 
                mutate(H_Cap = House_ind * Capacity) %>% 
                mutate(P_Cap = Private_ind * Capacity) %>% 
                mutate(ln_Capacity_x_Shared_ind = Shared_ind * ln_Capacity) %>% 
                mutate(ln_Capacity_x_House_ind = House_ind * ln_Capacity) %>% 
                mutate(ln_Capacity_x_Private_ind = Private_ind * ln_Capacity) %>%
                
                filter(!is.na(Price))

  return(listing.raw)
}

list.sin <- wrangle_airbnb_dataset(listing.sin)
list.nrt <- wrangle_airbnb_dataset(listing.nrt)
list.tpe <- wrangle_airbnb_dataset(listing.tpe)
list.hkg <- wrangle_airbnb_dataset(listing.hkg)

1.3.2 Wrangling Reviews.

There’s value in understanding how many reviews a property has received in the last 12 months as a measure of how active a property is. The notion is that modelling price for active listings will be more accurate than modelling price for all listings.

The approach taken in this paper was to look listings active in the past 12 months. However, given restrictions because of pandemic, we felt it would be better to look at the period between 1 Jan 2019 and 31 Dec 2021, to include one year in addition to the two pandemic years, 2020 and 2021.

We check this by wrangling the review dataset.

count_reviews <- function(listings, reviews, from_date, to_date)
{
  reviews_grouped <- reviews %>%
                  mutate(date = as.Date(date)) %>%
                  filter(between(date, as.Date(from_date), as.Date(to_date))) %>%
                  group_by(listing_id) %>%
                  summarise(reviews_since_2019 = n()) %>%
                  mutate(bookings_since_2019 = reviews_since_2019*2) %>%
                  rename(id = listing_id)
  listings <- left_join(listings, reviews_grouped, by="id")
  return(listings)
}
start_date = "2019-1-1"
end_date = "2021-12-31"
list.sin <-count_reviews(list.sin, reviews.sin,start_date, end_date)
list.hkg <- count_reviews(list.hkg, reviews.hkg,start_date, end_date)
list.tpe <- count_reviews(list.tpe, reviews.tpe,start_date, end_date)
list.nrt <- count_reviews(list.nrt, reviews.nrt,start_date, end_date)

list_after_2019.sin <- list.sin %>% filter(!is.na(reviews_since_2019)) 
list_after_2019.tpe <- list.tpe %>% filter(!is.na(reviews_since_2019))
list_after_2019.nrt <- list.nrt %>% filter(!is.na(reviews_since_2019))
list_after_2019.hkg <- list.hkg %>% filter(!is.na(reviews_since_2019))

Let’s try to look listings after 2019 in map form.

generate_choropleth_by_city(list_after_2019.sin, map.sin, "Singapore")
generate_choropleth_by_city(list_after_2019.nrt, map.nrt, "Tokyo")
generate_choropleth_by_city(list_after_2019.hkg, map.hkg, "Hong Kong")
generate_choropleth_by_city(list_after_2019.tpe, map.tpe, "Taipei")
bar_charts_by_neighbourhood(list_after_2019.sin, "Singapore", neighbourhoods.sin)
##     neighbourhood_cleansed n
## 1              Marina East 0
## 2             Straits View 0
## 3                   Changi 0
## 4               Changi Bay 0
## 5               Paya Lebar 0
## 6    North-Eastern Islands 0
## 7                  Seletar 0
## 8                  Simpang 0
## 9             Sungei Kadut 0
## 10                Boon Lay 0
## 11                 Pioneer 0
## 12                  Tengah 0
## 13                    Tuas 0
## 14         Western Islands 0
## 15 Western Water Catchment 0
bar_charts_by_neighbourhood(list_after_2019.nrt, "Tokyo", neighbourhoods.nrt)
##    neighbourhood_cleansed n
## 1          Aogashima Mura 0
## 2               Fussa Shi 0
## 3           Hachijo Machi 0
## 4       Higashiyamato Shi 0
## 5            Hinode Machi 0
## 6           Hinohara Mura 0
## 7               Inagi Shi 0
## 8              Kiyose Shi 0
## 9          Kozushima Mura 0
## 10        Mikurajima Mura 0
## 11            Miyake Mura 0
## 12           Mizuho Machi 0
## 13           Niijima Mura 0
## 14         Ogasawara Mura 0
## 15           Oshima Machi 0
## 16           Toshima Mura 0
bar_charts_by_neighbourhood(list_after_2019.hkg, "Hong Kong", neighbourhoods.hkg)
## [1] neighbourhood_cleansed n                     
## <0 rows> (or 0-length row.names)
bar_charts_by_neighbourhood(list_after_2019.tpe, "Taipei", neighbourhoods.tpe)
## [1] neighbourhood_cleansed n                     
## <0 rows> (or 0-length row.names)
add_earnings <- function(listing)
{
  return (listing %>% mutate(earnings_since_2019 = bookings_since_2019 * 3 * Price))
}
list_after_2019.sin <- add_earnings(list_after_2019.sin) 
list_after_2019.tpe <- add_earnings(list_after_2019.tpe) 
list_after_2019.nrt <- add_earnings(list_after_2019.nrt) 
list_after_2019.hkg <- add_earnings(list_after_2019.hkg) 

Let’s group listings into groups of neighbourhoods: extremely popular, popular, moderate, not so popular, and sparse.

district_bins.sin <- bin_districts(list_after_2019.sin, bins=5)
district_bins.sin
##     neighbourhood_cleansed   n nb_group
## 1                  Geylang 224        5
## 2                  Kallang 221        5
## 3                   Outram 201        5
## 4                   Rochor 136        5
## 5            Downtown Core 134        5
## 6                    Bedok 101        5
## 7              Bukit Merah  94        5
## 8                   Novena  73        5
## 9             River Valley  67        4
## 10              Queenstown  52        4
## 11                 Tanglin  38        4
## 12         Singapore River  36        4
## 13             Jurong West  32        4
## 14           Marine Parade  27        4
## 15             Jurong East  24        4
## 16                 Orchard  24        4
## 17                  Newton  22        3
## 18               Woodlands  22        3
## 19             Bukit Timah  21        3
## 20               Serangoon  21        3
## 21                Clementi  17        3
## 22                  Bishan  16        3
## 23                Tampines  16        3
## 24                 Hougang  15        3
## 25               Toa Payoh  13        2
## 26                  Museum   9        2
## 27                 Punggol   9        2
## 28              Ang Mo Kio   8        2
## 29                  Yishun   8        2
## 30           Choa Chu Kang   7        2
## 31 Central Water Catchment   6        2
## 32               Pasir Ris   6        2
## 33             Bukit Batok   6        1
## 34           Bukit Panjang   6        1
## 35                Sengkang   5        1
## 36        Southern Islands   3        1
## 37            Marina South   2        1
## 38            Lim Chu Kang   1        1
## 39                  Mandai   1        1
## 40               Sembawang   1        1
district_bins.nrt <- bin_districts(list_after_2019.nrt, bins=5)
district_bins.nrt
##    neighbourhood_cleansed    n nb_group
## 1             Shinjuku Ku 1653        5
## 2                Taito Ku 1147        5
## 3               Sumida Ku  810        5
## 4              Toshima Ku  707        5
## 5              Shibuya Ku  509        5
## 6                  Ota Ku  357        5
## 7               Minato Ku  342        5
## 8                 Chuo Ku  330        5
## 9               Nakano Ku  255        5
## 10            Setagaya Ku  251        4
## 11          Katsushika Ku  216        4
## 12                Kita Ku  181        4
## 13            Suginami Ku  177        4
## 14             Arakawa Ku  153        4
## 15           Shinagawa Ku  137        4
## 16                Koto Ku  136        4
## 17             Edogawa Ku  135        4
## 18            Itabashi Ku  110        4
## 19             Chiyoda Ku  110        3
## 20              Bunkyo Ku  106        3
## 21              Adachi Ku   73        3
## 22              Meguro Ku   47        3
## 23              Nerima Ku   44        3
## 24           Hachioji Shi   18        3
## 25               Hino Shi   15        3
## 26            Machida Shi   14        3
## 27              Chofu Shi   12        3
## 28              Fuchu Shi   11        2
## 29          Kokubunji Shi   10        2
## 30             Mitaka Shi    9        2
## 31            Akiruno Shi    7        2
## 32    Higashimurayama Shi    7        2
## 33          Kunitachi Shi    7        2
## 34          Musashino Shi    7        2
## 35          Tachikawa Shi    7        2
## 36               Tama Shi    7        2
## 37         Nishitokyo Shi    6        1
## 38            Kodaira Shi    5        1
## 39                Ome Shi    5        1
## 40              Komae Shi    4        1
## 41             Hamura Shi    3        1
## 42    Musashimurayama Shi    3        1
## 43          Okutama Machi    3        1
## 44           Akishima Shi    2        1
## 45      Higashikurume Shi    2        1
## 46            Koganei Shi    2        1
district_bins.hkg <- bin_districts(list_after_2019.hkg, bins=5)
district_bins.hkg
##    neighbourhood_cleansed    n nb_group
## 1           Yau Tsim Mong 1208        5
## 2                Wan Chai  311        5
## 3       Central & Western  279        5
## 4                 Islands  211        4
## 5            Kowloon City   96        4
## 6                 Eastern   70        4
## 7               Yuen Long   66        3
## 8                   North   65        3
## 9                Sai Kung   47        3
## 10           Sham Shui Po   34        3
## 11                Sha Tin   24        2
## 12               Southern   24        2
## 13                 Tai Po   19        2
## 14               Tuen Mun   12        2
## 15              Kwun Tong    9        1
## 16              Tsuen Wan    4        1
## 17             Kwai Tsing    3        1
## 18           Wong Tai Sin    3        1
district_bins.tpe <- bin_districts(list_after_2019.tpe, bins=5)
district_bins.tpe
##    neighbourhood_cleansed   n nb_group
## 1                  萬華區 536        5
## 2                  中正區 475        5
## 3                  大安區 454        4
## 4                  中山區 407        4
## 5                  信義區 283        3
## 6                  大同區 153        3
## 7                  松山區 142        2
## 8                  士林區 119        2
## 9                  文山區  64        2
## 10                 北投區  54        1
## 11                 內湖區  49        1
## 12                 南港區  24        1
list_after_2019.sin <- left_join(list_after_2019.sin, district_bins.sin %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.nrt <- left_join(list_after_2019.nrt, district_bins.nrt %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.hkg <- left_join(list_after_2019.hkg, district_bins.hkg %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
list_after_2019.tpe <- left_join(list_after_2019.tpe, district_bins.tpe %>% select(neighbourhood_cleansed, nb_group), by="neighbourhood_cleansed")
# list_after_2019.sin
list_after_2019.sin <- dummy_cols(list_after_2019.sin, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.nrt <- dummy_cols(list_after_2019.nrt, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.tpe <- dummy_cols(list_after_2019.tpe, select_columns = "nb_group", remove_selected_columns = TRUE)
list_after_2019.hkg <- dummy_cols(list_after_2019.hkg, select_columns = "nb_group", remove_selected_columns = TRUE)

# list_after_2019.sin
# list_after_2019.nrt
# list_after_2019.hkg
# list_after_2019.tpe

This reduces the number of listings, and hopefully, quite a few outliers.

cities  <- c("Singapore", "Tokyo", "Taipei", "Hong Kong")
no_of_listings <- c(nrow(listing.sin), nrow(listing.nrt), nrow(listing.tpe), nrow(listing.hkg))
no_of_listings_after_2019 <- c(nrow(list_after_2019.sin), nrow(list_after_2019.nrt), nrow(list_after_2019.tpe), nrow(list_after_2019.hkg))
data <- data.frame(cities, no_of_listings, no_of_listings_after_2019)
no_of_listings.fig <- plot_ly(data, 
  x = cities,
  y = ~no_of_listings,
  type = "bar", 
  text = no_of_listings,
  name = "No of Listings (All Years)"
)
no_of_listings.fig <- no_of_listings.fig %>% add_trace(y = ~no_of_listings_after_2019, text= no_of_listings_after_2019, name = "No of Active Listings")
no_of_listings.fig <- no_of_listings.fig %>% layout(title ="No of Listings Per City", yaxis = list(title="No of Listings"))
no_of_listings.fig

This roughly halves the number of listings being considered in Hong Kong and Singapore, but not in Taipei and Tokyo.

2. Variable Selection: Exploratory Data Analysis

We now attempt to check on variables for each city.

2.1 Singapore

data_exploration <- function (listing)
{
  plot_str(listing, type="r")
  introduce(listing)
  plot_intro(listing)
  plot_missing(listing)
  plot_bar(listing)
  pca_df <- na.omit(list.sin[, c("Price", "Room_Type", "Reviews", "Beds", "Capacity", "Monthly_Reviews", "host_Superhost", "Rating")])#,"Days_since_last_review", "host_response_rate", "host_response_hours", "host_acceptance_rate","host_response_day", "host_response_few_days")])
  plot_qq(pca_df)
  plot_prcomp(pca_df, variance_cap = 0.9, nrow = 2L, ncol=2L)
}
data_exploration(list.sin)

## 4 columns ignored with more than 50 categories.
## Property_Type: 51 categories
## amenities: 2815 categories
## last_review: 962 categories
## host_verifications: 140 categories

Taipei

data_exploration(list.tpe)

## 4 columns ignored with more than 50 categories.
## Property_Type: 58 categories
## amenities: 3376 categories
## last_review: 1050 categories
## host_verifications: 158 categories

Hong Kong

data_exploration(list.hkg)

## 4 columns ignored with more than 50 categories.
## Property_Type: 69 categories
## amenities: 3846 categories
## last_review: 1125 categories
## host_verifications: 150 categories

Tokyo

data_exploration(list.nrt)

## 4 columns ignored with more than 50 categories.
## Property_Type: 64 categories
## amenities: 7467 categories
## last_review: 987 categories
## host_verifications: 192 categories

2.2 Boxplots

We will now check out outliers in our data for various parameters, filtering for listings that have seen at least one booking since 1 Jan 2019, starting with Singapore data.

generate_price_boxplot <- function (listing.clean, city, comparison_col = "")
{
  # png(file = "./graphs/boxplot.png")
  if (comparison_col == "")
  {
    boxplot(listing.clean$Price, data = listing.clean, ylab="Price", main=paste("Boxplot: Price for", city))
  }
  else
    boxplot(listing.clean$Price ~ listing.clean[[comparison_col]], data = listing.clean, ylab="Price", xlab=comparison_col, main=paste("Boxplot: Price vs", comparison_col, "for", city))
  # dev.off()
}

generate_price_boxplot(list_after_2019.sin, "Singapore") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "Room_Type") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "Property_Type") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "Capacity") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "Beds") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "neighbourhood_cleansed") #, sin_listing.clean$)

generate_price_boxplot(list_after_2019.sin, "Singapore", "Reviews") #, sin_listing.clean$)

Seems like a single boat in Bukit Merah area (possibly next to the marina at Keppel Bay) has a very high price, at $2500/ night. Let’s look that one up more closely.

# head(list_after_2019.sin %>% arrange(desc(Price)))
# head(list_after_2019.sin %>% arrange(desc(reviews_since_2019)))
head(list_after_2019.sin %>% filter(Property_Type == "Boat") %>% arrange (desc(reviews_since_2019)))
##         id Price Reviews Beds Baths Capacity Monthly_Reviews Property_Type
## 1 31527262   344     217    1    NA        2            6.24          Boat
## 2 37907711   199     177    3    NA        4            6.23          Boat
## 3 20247516  2500      55    4    NA        5            1.05          Boat
## 4 50433019   288       7    2    NA        5            2.50          Boat
##         Room_Type Rating neighbourhood_cleansed host_response_time
## 1 Entire home/apt   4.94       Southern Islands     within an hour
## 2 Entire home/apt   4.49                Punggol     within an hour
## 3 Entire home/apt   4.74            Bukit Merah       within a day
## 4 Entire home/apt   5.00                Punggol within a few hours
##   host_response_rate host_acceptance_rate host_Superhost latitude longitude
## 1               <NA>                 1.00              1  1.24535  103.8387
## 2               <NA>                 0.96              0  1.41585  103.9001
## 3               <NA>                 0.98              0  1.26520  103.8190
## 4               <NA>                 0.92              0  1.41480  103.8986
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           amenities
## 1                                                                                                                                                                                  Toaster,Sound system,Hangers,Bed linens,Hot water kettle,Coffee maker,Carbon monoxide alarm,Hair dryer,TV,Outdoor furniture,Dining table,Security cameras on property,Outdoor dining area,Private entrance,Refrigerator,Microwave,Waterfront,Wifi,Smoke alarm,Shampoo,Portable fans,Paid parking off premises,Hot water,Mini fridge,Essentials,First aid kit,Dishes and silverware,Air conditioning,Shower gel,Fire extinguisher
## 2                                                                                                                                                                                                                 Sound system,Hangers,Bed linens,Coffee maker,Hair dryer,TV,Paid parking on premises,Lockbox,Room-darkening shades,Security cameras on property,Long term stays allowed,Patio or balcony,Pour-over coffee,Microwave,Waterfront,Wifi,Smoke alarm,Shampoo,Extra pillows and blankets,Hot water,Mini fridge,Essentials,Kitchen,EV charger,First aid kit,Air conditioning,Shower gel,Fire extinguisher
## 3                                                                                                                                                                                                                                                                                                                                                                                                                             Shampoo,Essentials,Long term stays allowed,Carbon monoxide alarm,Hair dryer,Host greets you,Hot water,Pool,TV,Paid parking on premises,Air conditioning,Smoke alarm,Fire extinguisher
## 4 Toaster,Bidet,Sound system,Rice maker,Hangers,Bed linens,Hot water kettle,Cooking basics,Freezer,Washer,Bathtub,Hair dryer,TV,Outdoor furniture,Dining table,Dedicated workspace,Free parking on premises,Lockbox,Cleaning products,Clothing storage: closet,dresser,and wardrobe,Long term stays allowed,Outdoor dining area,Private entrance,Pocket wifi,Refrigerator,Waterfront,Wifi,Smoke alarm,Induction stove,Dishwasher,Extra pillows and blankets,Portable fans,Hot water,Mini fridge,Iron,Essentials,Boat slip,Kitchen,EV charger,First aid kit,Air conditioning,Dishes and silverware,Fire extinguisher
##   last_review no_of_am Amenities_Wifi Amenities_Shampoo Amenities_Kitchen
## 1  2021-09-15       30              1                 1                 0
## 2  2021-12-12       28              1                 1                 1
## 3  2020-07-28       13              0                 1                 0
## 4  2021-12-07       45              1                 0                 1
##   Amenities_Long_Term Amenities_Washer Amenities_HairDryer Amenities_HotWater
## 1                   0                0                   1                  1
## 2                   1                0                   1                  1
## 3                   1                0                   1                  1
## 4                   1                1                   1                  1
##   Amenities_TV Amenities_AC
## 1            1            1
## 2            1            1
## 3            1            1
## 4            1            1
##                                                                           host_verifications
## 1 'email','phone','jumio','offline_government_id','selfie','government_id','identity_manual'
## 2 'email','phone','jumio','offline_government_id','selfie','government_id','identity_manual'
## 3               'email','phone','reviews','jumio','selfie','government_id','identity_manual'
## 4                                                                                    'phone'
##   hv_email hv_phone hv_facebook hv_reviews hv_manual_offline hv_manual_jumio
## 1        1        1           0          0                 0               1
## 2        1        1           0          0                 0               1
## 3        1        1           0          1                 0               1
## 4        0        1           0          0                 0               0
##   hv_manual_off_gov hv_manual_gov hv_manual_work_email no_of_vf
## 1                 1             1                    0        7
## 2                 1             1                    0        7
## 3                 0             1                    0        7
## 4                 0             0                    0        1
##   host_response_hours 1 0 host_response_day host_response_few_days
## 1                  NA 1 0                NA                     NA
## 2                  NA 1 0                NA                     NA
## 3                  NA 1 0                NA                     NA
## 4                  NA 1 0                NA                     NA
##   Days_since_last_review Capacity_Sqr Beds_Sqr Baths_Sqr ln_Price   ln_Beds
## 1                    107            4        1        NA 5.843544 0.6931472
## 2                     19           16        9        NA 5.298317 1.3862944
## 3                    521           25       16        NA 7.824446 1.6094379
## 4                     24           25        4        NA 5.666427 1.0986123
##   ln_Baths ln_Capacity ln_Rating Shared_ind House_ind Private_ind
## 1       NA    1.098612  1.781709          0         1           0
## 2       NA    1.609438  1.702928          0         1           0
## 3       NA    1.791759  1.747459          0         1           0
## 4       NA    1.791759  1.791759          0         1           0
##   Capacity_x_Shared_ind H_Cap P_Cap ln_Capacity_x_Shared_ind
## 1                     0     2     0                        0
## 2                     0     4     0                        0
## 3                     0     5     0                        0
## 4                     0     5     0                        0
##   ln_Capacity_x_House_ind ln_Capacity_x_Private_ind reviews_since_2019
## 1                1.098612                         0                217
## 2                1.609438                         0                177
## 3                1.791759                         0                 26
## 4                1.791759                         0                  7
##   bookings_since_2019 earnings_since_2019 nb_group_1 nb_group_2 nb_group_3
## 1                 434              447888          1          0          0
## 2                 354              211338          0          1          0
## 3                  52              390000          0          0          0
## 4                  14               12096          0          1          0
##   nb_group_4 nb_group_5
## 1          0          0
## 2          0          0
## 3          0          1
## 4          0          0
head(list_after_2019.sin %>% group_by(id, Property_Type, bookings_since_2019) %>% summarise(percent_of_total = bookings_since_2019*100/sum(list_after_2019.sin$bookings_since_2019)) %>% filter(Property_Type == "Boat") %>% arrange (desc(bookings_since_2019)))
## `summarise()` has grouped output by 'id', 'Property_Type'. You can override
## using the `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   id, Property_Type [4]
##         id Property_Type bookings_since_2019 percent_of_total
##      <int> <chr>                       <dbl>            <dbl>
## 1 31527262 Boat                          434           0.967 
## 2 37907711 Boat                          354           0.788 
## 3 20247516 Boat                           52           0.116 
## 4 50433019 Boat                           14           0.0312

There are four boats listed on Airbnb Singapore. Together, they form roughly 2% of all bookings since 2019.

2.3 Correlation Matrices

list_of_vars = c("earnings_since_2019","Rating", "Reviews", "Beds", "Capacity", "host_acceptance_rate", "host_Superhost","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer","Amenities_HairDryer", "Amenities_HotWater", "Amenities_TV", "Amenities_AC", "hv_email",  "hv_reviews", "Shared_ind", "House_ind", "Private_ind")#, "reviews_since_2019","bookings_since_2019") #, "hood_factor")
# list_after_2019.sin %>% select_(.dots = c(list_of_vars), "Price")
vars_list.sin = list_after_2019.sin %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
vars_list.tpe = list_after_2019.tpe %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
vars_list.hkg = list_after_2019.hkg %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()
vars_list.nrt = list_after_2019.nrt %>% select_(.dots = c(list_of_vars,"Price")) %>% na.omit()

# vars_list.sin
paint_corrleations <- function(listing)
{
  # chart.Correlation(listing, histogram=TRUE, pch=19)
  corrplot::corrplot(cor(listing, use = "complete.obs"), method="square", type="lower")
}
paint_corrleations(vars_list.sin)

paint_corrleations(vars_list.nrt)

paint_corrleations(vars_list.tpe)

paint_corrleations(vars_list.hkg)

3 Modelling

3.1 Principal Components Regression - Predicting Price

Principal Components Regression could find M linear combinations (“principal components”) of our predictors (list_of_vars) and then use least squares to fit a linear regression model.

set.seed(1)

pcr_model <- function (listings, city)
{
  pcr_model <- pcr( data=listings, scale=TRUE, validation="CV", Price ~ reviews_since_2019 + Rating + host_acceptance_rate +host_Superhost + reviews_since_2019 + Shared_ind + House_ind + Private_ind + Amenities_Wifi + hv_email) 
  
  summary(pcr_model)
  plot(pcr_model)
  validationplot(pcr_model, val.type="MSEP")
  validationplot(pcr_model, val.type="R2")
  print(paste("MAE for", city,":", mae(listings$Price, predict(pcr_model))))
  return (pcr_model)
}
pcr_model.sin <-pcr_model(list_after_2019.sin, "Singapore")
## Data:    X dimension: 1405 9 
##  Y dimension: 1405 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           152.6    142.9    142.1    141.6    142.4    142.2    142.3
## adjCV        152.6    142.9    142.0    141.6    142.3    142.1    142.2
##        7 comps  8 comps  9 comps
## CV       140.8    140.7    140.6
## adjCV    140.7    140.6    140.5
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X        22.14    40.73    53.25    64.19    74.57    84.27    92.71   100.00
## Price    12.89    13.94    14.45    14.45    15.23    15.23    16.90    17.21
##        9 comps
## X       100.00
## Price    17.21

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Singapore : 108.772298292697"
pcr_model.hkg <-pcr_model(list_after_2019.hkg, "Hong Kong")
## Data:    X dimension: 1945 9 
##  Y dimension: 1945 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            2455     2455     2455     2454     2455     2456     2457
## adjCV         2455     2455     2455     2454     2455     2456     2456
##        7 comps  8 comps  9 comps
## CV        2458     2455     2455
## adjCV     2457     2454     2455
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps   8 comps
## X      21.7730  37.2315  50.3304  62.1254  72.9752  83.0920  91.9797  100.0000
## Price   0.2567   0.2572   0.4119   0.4395   0.4482   0.5174   0.5345    0.8255
##         9 comps
## X      100.0000
## Price    0.8256

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Hong Kong : 869.547093791851"
pcr_model.tpe <-pcr_model(list_after_2019.tpe, "Taipei")
## Data:    X dimension: 2175 9 
##  Y dimension: 2175 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4229     4170     4170     4153     4154     4156     4148
## adjCV         4229     4170     4169     4153     4153     4156     4148
##        7 comps  8 comps  9 comps
## CV        4134     4131     4132
## adjCV     4133     4130     4131
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.408   39.493   52.740   64.281   74.915   84.836   93.121  100.000
## Price    2.953    3.029    3.792    3.795    3.795    4.154    4.932    5.071
##        9 comps
## X      100.000
## Price    5.078

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Taipei : 2209.48295156008"
pcr_model.nrt <-pcr_model(list_after_2019.nrt, "Tokyo")
## Data:    X dimension: 7249 9 
##  Y dimension: 7249 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           29610    29516    29490    29496    29487    29489    29492
## adjCV        29610    29516    29489    29495    29486    29488    29491
##        7 comps  8 comps  9 comps
## CV       29492    29494    29495
## adjCV    29490    29492    29493
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps   8 comps
## X      22.2622  37.4091  49.4255  61.1240  71.8952  82.0212  91.8955  100.0000
## Price   0.6487   0.8555   0.8623   0.9325   0.9325   0.9327   0.9626    0.9718
##         9 comps
## X      100.0000
## Price    0.9733

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Tokyo : 12041.061870605"

The mean absolute error for each city is about $108.

3.2 Predicting Earnings

set.seed(1)

pcr_model_earnings <- function (listings, city)
{
  pcr_model <- pcr( data=listings, scale=TRUE, validation="CV", earnings_since_2019 ~ Price + reviews_since_2019 + Rating + host_acceptance_rate +host_Superhost + reviews_since_2019 + Shared_ind + House_ind + Private_ind + Amenities_Wifi + hv_email) #", 
                   # "host_Superhost", "no_of_am","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer",
                   # "Amenities_HairDryer", "Amenities_HotWater", "Amenities_TV", "Amenities_AC", "hv_email",  "hv_facebook", "hv_reviews",
                   # "hv_manual_offline", "hv_manual_jumio", "hv_manual_off_gov", "hv_manual_gov", "hv_manual_work_email", "no_of_vf", "Days_since_last_review",
                   # , "reviews_since_2019","bookings_since_2019")
  
  summary(pcr_model)
  plot(pcr_model)
  validationplot(pcr_model, val.type="MSEP")
  validationplot(pcr_model, val.type="R2")
  print(paste("MAE for", city,":", mae(listings$earnings_since_2019, predict(pcr_model))))
  return (pcr_model)
}
pcr_model.sin <-pcr_model_earnings(list_after_2019.sin, "Singapore")
## Data:    X dimension: 1405 10 
##  Y dimension: 1405 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           31428    29557    28372    28288    25678    25378    22925
## adjCV        31428    29554    28366    28286    24823    25441    23114
##        7 comps  8 comps  9 comps  10 comps
## CV       17510    17465    16439     16436
## adjCV    17453    17414    16395     16391
## 
## TRAINING: % variance explained
##                      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                      22.14    38.99     50.4    60.24    69.93    78.66
## earnings_since_2019    11.70    18.99     19.6    37.12    41.06    52.10
##                      7 comps  8 comps  9 comps  10 comps
## X                      87.03    93.82   100.00    100.00
## earnings_since_2019    71.88    72.12    75.15     75.15

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Singapore : 20085.8644280392"
pcr_model.hkg <-pcr_model_earnings(list_after_2019.hkg, "Hong Kong")
## Data:    X dimension: 1945 10 
##  Y dimension: 1945 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          410901   402205   390090   377536   381124   376353   341603
## adjCV       410901   402288   390081   377417   381617   367444   338156
##        7 comps  8 comps  9 comps  10 comps
## CV      346454   318750   295779    295967
## adjCV   343001   315592   292494    292664
## 
## TRAINING: % variance explained
##                      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                     19.648    33.56    45.45    56.10    66.09    75.83
## earnings_since_2019    3.538    10.11    15.39    15.39    48.30    49.76
##                      7 comps  8 comps  9 comps  10 comps
## X                      84.87    92.86   100.00    100.00
## earnings_since_2019    49.91    58.56    65.66     65.66

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Hong Kong : 200007.767176238"
pcr_model.tpe <-pcr_model_earnings(list_after_2019.tpe, "Taipei")
## Data:    X dimension: 2175 10 
##  Y dimension: 2175 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV         1181326  1104149  1103314  1071703  1072319   990895   976066
## adjCV      1181326  1104158  1103341  1071364  1073671   959962   972962
##        7 comps  8 comps  9 comps  10 comps
## CV      967203   780310   753804    752060
## adjCV   964362   776832   750544    748890
## 
## TRAINING: % variance explained
##                      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                      22.49    36.10    48.35    58.74    68.46    78.03
## earnings_since_2019    12.47    12.77    18.39    18.73    38.85    38.90
##                      7 comps  8 comps  9 comps  10 comps
## X                      86.66    93.84   100.00    100.00
## earnings_since_2019    42.44    62.81    65.52     65.52

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Taipei : 621874.28842328"
pcr_model.nrt <-pcr_model_earnings(list_after_2019.nrt, "Tokyo")
## Data:    X dimension: 7249 10 
##  Y dimension: 7249 1
## Fit method: svdpc
## Number of components considered: 10
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV         7764047  7613827  7340841  7216238  7147811  5899329  5450582
## adjCV      7764047  7613976  7341342  7222899  7149043  5652462  5440735
##        7 comps  8 comps  9 comps  10 comps
## CV     4892522  4785782  4762564   4762764
## adjCV  4863554  4777938  4754861   4755034
## 
## TRAINING: % variance explained
##                      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                     20.165    33.87    44.69    55.31    65.04    74.74
## earnings_since_2019    3.815    10.69    14.11    15.84    50.72    53.89
##                      7 comps  8 comps  9 comps  10 comps
## X                      83.85    92.71   100.00    100.00
## earnings_since_2019    63.09    64.32    64.64     64.64

## Warning in actual - predicted: longer object length is not a multiple of shorter
## object length

## [1] "MAE for Tokyo : 3905051.77520906"

This doesn’t work

# lm0 <- lm(Price ~ Capacity, data = sin_listing.clean)
# summary(lm0)
# stargazer(lm0, type = "text")
#
# ggplot(data = sin_listing.clean, aes(x = Capacity, y = Price)) + geom_point(aes(size=3)) +
# scale_colour_hue(l=50) + # Use a slightly darker palette than normal
# geom_smooth(method=lm,   # Add linear regression lines
#            se=TRUE,    #  add shaded confidence region
#            fullrange=TRUE) +
# theme(axis.text.x = element_text(size=15), axis.text.y = element_text(size=15),
#         axis.title=element_text(size=15,face="bold"))
# # vif(lm0)
# ```
# ```{r}
#
# # The moderating effect of type of room. Lets model that.
#
# lm1 <- lm(Price ~ Private_ind + House_ind, data = sin_listing.clean)
# summary(lm1)
# stargazer(lm1, type = "text")
# ```
# ```{r}
# #Regression with Capacity and Dummy Variables for type of room:
#
# lm2 <- lm(Price ~ Capacity + Private_ind + House_ind, data = sin_listing.clean)
# summary(lm2)
# stargazer(lm2, type = "text")
#Regression with Capacity,Dummy Variables and interaction between the two:
# lm3 <- lm(Price ~ Capacity+Private_ind + House_ind+P_Cap+H_Cap, data = sin_listing.clean)
# summary(lm3)
# stargazer(lm3,type = "text")

Define function to clean

Comments from Kevin: - I have arbitrarily selected most of the variables, removing those which do not make sense (e.g. id, latitude) and those completely N.A. (e.g Baths) - Instead of imputing, I have removed rows with N.A. In the case of SG data, the entries goes from 1725 to 1373.

### For checking number of missing data
# md.pattern(list_after_2019.country)

clean_subset <- function(list_after_2019.country) {
  
### Arbitrary selection of a list of variables 
  selecting_columns <- list_after_2019.sin[,c("Price","Reviews","Beds","Capacity","Monthly_Reviews","Property_Type","Room_Type","Rating","neighbourhood_cleansed","host_response_time","host_acceptance_rate","host_Superhost","no_of_am","Amenities_Wifi","Amenities_Shampoo","Amenities_Kitchen","Amenities_Long_Term","Amenities_Washer","Amenities_HairDryer","Amenities_HotWater","Amenities_TV","Amenities_AC","hv_email","hv_phone","hv_facebook","hv_reviews","hv_manual_offline","hv_manual_jumio","hv_manual_off_gov","hv_manual_gov","hv_manual_work_email","no_of_vf","Days_since_last_review","Capacity_Sqr","Beds_Sqr","ln_Price","ln_Beds","ln_Capacity","ln_Rating","Shared_ind","House_ind","Private_ind","Capacity_x_Shared_ind","H_Cap","P_Cap","ln_Capacity_x_Shared_ind","ln_Capacity_x_House_ind","ln_Capacity_x_Private_ind","reviews_since_2019","bookings_since_2019", "earnings_since_2019" )]

### Removing rows with blanks instead of imputing
selecting_columns <- na.omit(selecting_columns)
selecting_columns$Property_Type <- as.factor(selecting_columns$Property_Type)
selecting_columns$neighbourhood_cleansed  <- as.factor(selecting_columns$neighbourhood_cleansed )
selecting_columns$host_response_time <- as.factor(selecting_columns$host_response_time)

  return(selecting_columns)
}

list_after_2019.sin_clean <- clean_subset(list_after_2019.sin)

Feature Selection - Backward Stepwise Regression

Comments from Kevin: Can help to double check if the R-squared value looks too high at 0.80 for a real-world dataset?

#Define Smallest and Full Model 
minmod = lm(Price~1, data = list_after_2019.sin_clean)
fullmod = lm(Price~. , data = list_after_2019.sin_clean)

# Using BIC: k=log(nobs(fullmod), Using AIC: k=2
backward_regression_model <- step(fullmod, scope = list(lower = minmod, upper = fullmod),direction = "backward", k=log(nobs(fullmod)), trace=F)
summary(backward_regression_model)
## 
## Call:
## lm(formula = Price ~ Reviews + no_of_am + Amenities_Wifi + Amenities_Shampoo + 
##     Amenities_Kitchen + Amenities_TV + no_of_vf + ln_Price + 
##     ln_Capacity + H_Cap + ln_Capacity_x_Shared_ind + ln_Capacity_x_House_ind + 
##     reviews_since_2019 + earnings_since_2019, data = list_after_2019.sin_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -385.43  -30.80  -10.79   19.62 1067.22 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -6.257e+02  2.382e+01 -26.269  < 2e-16 ***
## Reviews                   4.980e-01  7.689e-02   6.477 1.31e-10 ***
## no_of_am                  1.692e+00  2.281e-01   7.418 2.09e-13 ***
## Amenities_Wifi           -6.285e+01  1.622e+01  -3.874 0.000112 ***
## Amenities_Shampoo        -2.378e+01  4.258e+00  -5.586 2.81e-08 ***
## Amenities_Kitchen        -2.062e+01  5.487e+00  -3.758 0.000178 ***
## Amenities_TV             -1.695e+01  5.063e+00  -3.347 0.000839 ***
## no_of_vf                  3.267e+00  1.019e+00   3.206 0.001377 ** 
## ln_Price                  1.839e+02  4.233e+00  43.442  < 2e-16 ***
## ln_Capacity              -3.195e+01  7.289e+00  -4.384 1.25e-05 ***
## H_Cap                     1.722e+01  3.167e+00   5.437 6.41e-08 ***
## ln_Capacity_x_Shared_ind  6.649e+01  7.385e+00   9.003  < 2e-16 ***
## ln_Capacity_x_House_ind  -5.521e+01  8.661e+00  -6.375 2.50e-10 ***
## reviews_since_2019       -2.630e+00  1.827e-01 -14.392  < 2e-16 ***
## earnings_since_2019       1.863e-03  1.094e-04  17.027  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.8 on 1358 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8096 
## F-statistic: 417.6 on 14 and 1358 DF,  p-value: < 2.2e-16
# Putting forward stepwise regression in comments in case we need to use 
# forward_regression = step(minmod, scope = list(lower = minmod, upper = fullmod),direction = "forward", k=log(nobs(fullmod)), trace=F)
# summary(forward_regression)

Training the Model

set.seed(123)
# Splitting the data into 80:20
train_idx <- sample(1:nrow(list_after_2019.sin_clean), 0.8*nrow(list_after_2019.sin_clean)) 
training_sin <- list_after_2019.sin_clean[train_idx,]
testing_sin <- list_after_2019.sin_clean[-train_idx,]

final_model_sin <- lm(formula = Price ~ ln_Price + Property_Type + ln_Capacity_x_Private_ind + 
    Capacity + Amenities_Wifi + Amenities_Shampoo + Amenities_AC + 
    no_of_am + Amenities_Kitchen + hv_email, data = training_sin)
summary(final_model_sin)
## 
## Call:
## lm(formula = Price ~ ln_Price + Property_Type + ln_Capacity_x_Private_ind + 
##     Capacity + Amenities_Wifi + Amenities_Shampoo + Amenities_AC + 
##     no_of_am + Amenities_Kitchen + hv_email, data = training_sin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -410.08  -32.78   -7.09   24.00 1196.07 
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                      -424.8837    51.2040  -8.298
## ln_Price                                          219.6829     5.0341  43.639
## Property_TypeCampsite                            -595.0186    60.3870  -9.853
## Property_TypeEntire condominium (condo)          -372.3346    37.4185  -9.951
## Property_TypeEntire guest suite                  -338.4095    63.2327  -5.352
## Property_TypeEntire guesthouse                   -357.4847    81.3368  -4.395
## Property_TypeEntire loft                         -400.6474    81.1304  -4.938
## Property_TypeEntire place                        -382.1517    56.0709  -6.816
## Property_TypeEntire rental unit                  -362.0118    37.5349  -9.645
## Property_TypeEntire residential home             -382.4777    41.7566  -9.160
## Property_TypeEntire serviced apartment           -345.4811    37.4602  -9.223
## Property_TypeEntire townhouse                    -411.3871    63.1003  -6.520
## Property_TypePrivate room                        -258.9714    53.9745  -4.798
## Property_TypePrivate room in bed and breakfast   -257.4717    55.3228  -4.654
## Property_TypePrivate room in bungalow            -241.7481    45.8062  -5.278
## Property_TypePrivate room in condominium (condo) -205.8855    41.1581  -5.002
## Property_TypePrivate room in guest suite         -227.2684    82.9654  -2.739
## Property_TypePrivate room in hostel              -232.9576    55.0536  -4.231
## Property_TypePrivate room in loft                -208.0485    51.7502  -4.020
## Property_TypePrivate room in rental unit         -206.8144    40.4686  -5.110
## Property_TypePrivate room in residential home    -199.0148    41.7347  -4.769
## Property_TypePrivate room in serviced apartment  -264.9364    45.1590  -5.867
## Property_TypePrivate room in townhouse           -218.4113    44.5380  -4.904
## Property_TypePrivate room in villa               -234.0597    52.7938  -4.433
## Property_TypeRoom in aparthotel                  -241.6445    82.4276  -2.932
## Property_TypeRoom in boutique hotel              -284.8855    41.1833  -6.918
## Property_TypeRoom in hotel                       -267.9793    41.7109  -6.425
## Property_TypeShared room                         -186.7206    49.1121  -3.802
## Property_TypeShared room in bed and breakfast    -200.6376    44.9283  -4.466
## Property_TypeShared room in boutique hotel       -267.6705    56.5970  -4.729
## Property_TypeShared room in hostel               -246.8231    42.8733  -5.757
## Property_TypeShared room in rental unit          -176.6594    53.0975  -3.327
## Property_TypeShared room in residential home     -173.6442    64.7973  -2.680
## Property_TypeTent                                -545.9528    82.9690  -6.580
## Property_TypeTiny house                          -387.0692    81.1098  -4.772
## ln_Capacity_x_Private_ind                         -77.3574    12.2706  -6.304
## Capacity                                            6.2504     1.4168   4.412
## Amenities_Wifi                                   -136.0849    22.9387  -5.933
## Amenities_Shampoo                                 -25.2203     5.6615  -4.455
## Amenities_AC                                      -37.5466    17.6964  -2.122
## no_of_am                                            0.9322     0.2995   3.112
## Amenities_Kitchen                                 -21.4168     7.7589  -2.760
## hv_email                                           29.3148     9.7303   3.013
##                                                  Pr(>|t|)    
## (Intercept)                                      3.21e-16 ***
## ln_Price                                          < 2e-16 ***
## Property_TypeCampsite                             < 2e-16 ***
## Property_TypeEntire condominium (condo)           < 2e-16 ***
## Property_TypeEntire guest suite                  1.07e-07 ***
## Property_TypeEntire guesthouse                   1.22e-05 ***
## Property_TypeEntire loft                         9.16e-07 ***
## Property_TypeEntire place                        1.58e-11 ***
## Property_TypeEntire rental unit                   < 2e-16 ***
## Property_TypeEntire residential home              < 2e-16 ***
## Property_TypeEntire serviced apartment            < 2e-16 ***
## Property_TypeEntire townhouse                    1.09e-10 ***
## Property_TypePrivate room                        1.83e-06 ***
## Property_TypePrivate room in bed and breakfast   3.67e-06 ***
## Property_TypePrivate room in bungalow            1.59e-07 ***
## Property_TypePrivate room in condominium (condo) 6.63e-07 ***
## Property_TypePrivate room in guest suite         0.006261 ** 
## Property_TypePrivate room in hostel              2.52e-05 ***
## Property_TypePrivate room in loft                6.23e-05 ***
## Property_TypePrivate room in rental unit         3.81e-07 ***
## Property_TypePrivate room in residential home    2.12e-06 ***
## Property_TypePrivate room in serviced apartment  5.94e-09 ***
## Property_TypePrivate room in townhouse           1.09e-06 ***
## Property_TypePrivate room in villa               1.02e-05 ***
## Property_TypeRoom in aparthotel                  0.003445 ** 
## Property_TypeRoom in boutique hotel              7.96e-12 ***
## Property_TypeRoom in hotel                       2.00e-10 ***
## Property_TypeShared room                         0.000152 ***
## Property_TypeShared room in bed and breakfast    8.84e-06 ***
## Property_TypeShared room in boutique hotel       2.56e-06 ***
## Property_TypeShared room in hostel               1.12e-08 ***
## Property_TypeShared room in rental unit          0.000908 ***
## Property_TypeShared room in residential home     0.007481 ** 
## Property_TypeTent                                7.39e-11 ***
## Property_TypeTiny house                          2.08e-06 ***
## ln_Capacity_x_Private_ind                        4.25e-10 ***
## Capacity                                         1.13e-05 ***
## Amenities_Wifi                                   4.04e-09 ***
## Amenities_Shampoo                                9.30e-06 ***
## Amenities_AC                                     0.034095 *  
## no_of_am                                         0.001908 ** 
## Amenities_Kitchen                                0.005875 ** 
## hv_email                                         0.002651 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.22 on 1055 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7952 
## F-statistic: 102.4 on 42 and 1055 DF,  p-value: < 2.2e-16

Checking Predictive Power and Mean absolute prediction error (MAE)

Comments from Kevin: Calculated the MAE value, but this number is probably more useful if its a comparison between models.

predict_sin <- predict(final_model_sin,newdata=testing_sin)
MAE_sin <- mean(abs(predict_sin - testing_sin$Price))

Checking for Assumptions on Outliers and Multicollinearity

Comments from Kevin: There seems to be out-lying points and issue of multicollinearity - we could either try to resolve or in the interest of time, just state that this is subsequently something we could look into

cook=cooks.distance(final_model_sin)
plot(cook,type='h',lwd=3,ylab="Cook's Distance")

vif(final_model_sin)
##                                 GVIF Df GVIF^(1/(2*Df))
## ln_Price                    2.920011  1        1.708804
## Property_Type             125.889002 33        1.076014
## ln_Capacity_x_Private_ind  10.885166  1        3.299268
## Capacity                    2.055544  1        1.433717
## Amenities_Wifi              1.492684  1        1.221754
## Amenities_Shampoo           1.489170  1        1.220316
## Amenities_AC                1.294541  1        1.137779
## no_of_am                    1.562050  1        1.249820
## Amenities_Kitchen           1.895419  1        1.376742
## hv_email                    1.110136  1        1.053630